1. Upstream analysis

The upstream analysis will include the analysis from downloading fastq files to gene count. Each step of the analysis will be explained in detail in each section. The nf-core pipelines were used for all processes in this upstream analysis according to the reproducibility concern. Please see the detail of nf-core setting in my note for computer workbench.

Overview of upstream analysis

Transcriptome analysis starts with fastq files as input files and pass through 5 different processes including
### 1. Pre-alignment quality control (QC): To test the quanlity of fastq file, the tools such as FastQC
Input = .fastq files
Output = the same .fastq files without any editing

2. Trimming: To cut the adapter out from each read.

Input = .fastq files
Output = .fastq files which adapter was cut.


3. Aligning: To map the sequencing read.

It is different versions of alignment depending the organism you are working with. Mainly, it is divided into 2 categories which are organism with and without reference genome.


#### 3.1 An organism with a reference genome: It is possible to infer which transcripts are expressed by mapping the reads to the reference genome (genome mapping) or transcriptome (transcriptome mapping).

To understand the different usage between Genome and Transcriptome mapping, you need to firstly back to central dogma as picture showed below. RNA-seq reads can be aligned directly to reference transcriptome. However, transcriptome mapping is not suitable for discovery of novel isoforms and splice junctions. Therefore, it depend on research question.

Credit pic: Bioinformatics Tools slide by P. Kueanjinda
Credit pic: Bioinformatics Tools slide by P. Kueanjinda


#### 3.2 An organism without a reference genome: The reads need to be assembled first into longer contigs (de novo assembly).

In my case, I want to quantify gene expression of Human without concerning about novel isoforms and slice junction. Therefore, I will use at least transcriptome mapping.
I choose STAR as my tool for alignment process. Mapping will be done on genome/transciptome and we also want to know whether which location(gene) it map to. Therefore, the additional files needs for the alignment process are reference genome and gene annotation, accordingly.


If we manually perform this alignment process using STAR, we need 2 main steps including
- Generating a genome index: using --runMode genomeGenerate - Reference genome alignment: The genome index from previous step will be used as one of the inputs for this step.

Therefore, input and output for STAR tool are - Input = sample files (.fastq files), reference genome (fasta file), gene annotation (GTF)
- Output = SAM or BAM files which adapter was cut.

4. Counting (Quantification)

The mapped reads will next be quantified the gene expression. I will use RSEM for this process. If we manually perform this quantification process using RSEM, we need 2 main steps including
- Reference preparation: using rsem-prepare-reference - Expression calculation: The RSEM reference from previous step will be used as one of the inputs for this step using rsem-calculate-expression

Therefore, input and output for RSEM tool are
- Input = mapped read from sample (BAM files), reference genome (fasta file), gene annotation (GTF)
- Output = Gene-level results (rsem.genes.results), Isoform-level results (rsem.isoforms.results)

5. Post-alignment quality control

Here RSeQC will be used and we will observe the QC result from all post-aligned files using MultiQC which will provide as a .html file.

Perform upstream analysis

1. Downloading fastq files from GEO

The input file for upstream analysis of transcriptomics is the fastq files. For this example, I will use the data from GSE197576. The data from the GSE includes all data under the study, which we don’t want all of those. Here, I want to compare normoxic and hypoxic condition. Therefore, I includ only control cell sgCTRL from hypoxic (GSM5920765, GSM5920766) and normoxic (GSM5920759, GSM5920760) condition.

By using nf-core fetchngs pipeline, the default (FTP) will be used.

To download those mentioned fastq files, we have to create a file named ids.csv where contains all GSM numbers that we want to download.

# Go to directory where `ids.csv` file was saved
cd data/fetchngs

# Oberving the `ids.csv` file
cat ids.csv

Next, I define the needed information in nf-core/fetchngs launch pipeline. After launch the workflow, I got nf-params.json file. I download the file to run it locally.
Here is my nf-params.json. Please note that we will run all process on High Performance Computing (HPC), thus the file paths are full path of HPC.

# Go to directory where the `nf-params.json file was saved
cd data/fetchngs

# Observing the `nf-params.json` file
cat nf-params.json

Now, All information is ready to perfrom nf-core/fetchngs to download 4 fastq files. To start nf-core/fetchngs process, I run the following command.

nextflow run nf-core/fetchngs -r 1.12.0 -name Hypox -params-file nf-params.json -profile docker

The process will start and show the picture below. Moreover, it will real-time update the status of each process.


After pipeline finished successfully, We will find these 4 fastq files as shown below.

2. Downloading reference genome and genome annotation

As I mentioned above that aligning on reference genome/transcriptome need reference genome and gene annotation.
Here, we will download the latest version of those two using the command below.

# Find the latest version 
latest_release=$(curl -s 'http://rest.ensembl.org/info/software?content-type=application/json' | grep -o '"release":[0-9]*' | cut -d: -f2)

# Download the Human (Homo_sapiens.GRCh38) reference genome 
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

# Download the Human (Homo_sapiens.GRCh38)gene annotation
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/gtf/homo_sapiens/Homo_sapiens.GRCh38.${latest_release}.gtf.gz

3. Performing upstream analysis

The upstream analysis was performed on HPC using a nf-core pipeline called rnaseq.

From nf-core/rnaseq pipeline, we will mainly use the dark blue pipeline where Trim Galora, STAR, and RSEM will be used for trimming, aligning, and quantification, respectively. I used those tools as mentioned above according to the reccomendation in STAT115 course which thought by Prof. Xiaole Shirley Liu, Harvard university

To analyze a set of .fastq files, we have to create a file named samplesheet.csv where includes all information about our .fastq file.
In my case, my data is single-end sequencing read. Thus, I have only one fastq file per each replicate of each sample.

# Go to directory that save `samplesheet.csv` file
cd data/rnaseq

# Observing the `samplesheet.csv` file
cat samplesheet.csv

Next, I define the needed information in nf-core/rnaseq launch pipeline. After launch the workflow, I got nf-params.json file. I download the file to run the analysis it locally.
Here is my nf-params.json. Please note that we will run all process on High Performance Computing (HPC), thus the file paths are full path of HPC.

# Go to directory where the `nf-params.json file was saved
cd data/rnaseq

# Observing the `nf-params.json` file
cat nf-params.json

Now, All information is ready to perfrom nf-core/rnaseq to analyze from 4 fastq files to gene count. To start nf-core/rnaseq process, I run the following command.

nextflow run nf-core/rnaseq -r 3.14.0 -name HYPOX -resume -params-file nf-params.json -profile docker
The process will start and show the picture below. Moreover, it will real-time update the status of each process.

We can observe QC from both pre and post-alignment and many more visualization created by the tools provided by nf-core/rnaseq pipeline from MultiQC.html file.

MultiQc file
MultiQc file

2. Downstream analysis


2.1 Observing the count matrix file in R

dplyr and readr are the needed R library in this section, so I called it.

library(dplyr)
library(readr)

raw_counts <- read_tsv("/home/rstudio/R_docker/port_RNA/MasteringR/From_HPC/STAR_RSEM/rsem.merged.gene_counts.tsv")

# observing head of the data
head(raw_counts)

# Convert all values in `raw_counts_mat` to integer according to DESeq2 input request.
# Use apply to convert columns 3 to 6 to integers
raw_counts[, 3:6] <- apply(raw_counts[, 3:6], 2, as.integer)
#raw_counts_r$ENSEMBL <- raw_counts$gene_id
#raw_counts <- raw_counts_r

# Assuming spc_tbl_ is your tibble/data frame
# Convert numeric columns to integers
spc_tbl_ <- raw_counts %>%
  mutate(across(starts_with("sgCTRL"), as.integer))  # Change this to match your column names if needed

# Print the modified data frame
print(spc_tbl_)

3.2 Map the ENSEMBL to the official gene symbol

From column gene_id, it provide ue the ENSEMBL ID, we next need to convert it to readable pattern liked gene symbol using the org.Hs.eg.db Bioconductor package

library(org.Hs.eg.db)

map<- AnnotationDbi::select(org.Hs.eg.db, 
                            keys = raw_counts$gene_id, # column that contain key (we want to convert this col.)
                            keytype = "ENSEMBL", # type of key in the column,
                            columns= "SYMBOL")# What we want to retrieve

head(map)

If you are not sure about how to call each type of data in keytype and columns, you can observe how each type was named using the following R code.

columns(org.Hs.eg.db)
keytypes(org.Hs.eg.db)

3.3 Join the gene counts and the gene symbol (map)

We next want the raw_counts with the gene symbol. To do this, we join raw_counts with map by matching the ENSEMBL. To join those two together, we need a share column which is ENSEMBL in this case. Therefore, we have to check the column name.

library(dplyr)

# Change the column name to `ENSEMBL`
colnames(raw_counts)[colnames(raw_counts) == "gene_id"] <- "ENSEMBL"

# Map `raw_counts` with `map`
raw_counts_map <- left_join(raw_counts, map)
head(raw_counts_map)

# Write `map` as .csv file
write.csv(raw_counts_map, file = "result/HypoxUpstream_raw_counts_map.csv", row.names = TRUE)

2.2 Manage the data to matrix format

We will delete gene column first, then put it back later

# Delete the `transcript_id`, `ENSEMBL`, and `gene` column and change all gene count columns to matrix.
raw_counts_mat <- raw_counts_map[, -c(1, 2, 7)] %>% 
                    as.matrix()

# Put the `gene` column back.
rownames(raw_counts_mat) <- raw_counts_map$SYMBOL
head(raw_counts_mat)

# Write `map_mat` as .csv file
write.csv(raw_counts_mat, file = "result/HypoxUpstream_raw_counts_mat.csv", row.names = TRUE)

I will subset only the common genes (common_genes) which are the genes that found both in our data raw_counts_mat and gene symbol annotation map$SYMBOL.

common_genes <- intersect(rownames(raw_counts_mat), map$SYMBOL)

head(common_genes)

Here is map which contains ENTREZID, length, and gene symbol information that intersect and re-order according to common_genes.

# select only the common genes and re-order them by common_genes
map<- map %>%
  dplyr::slice(match(common_genes, SYMBOL))

head(map)

Here is raw_count_mat which contains gene symbol and gene count information that intersect and re-order according to common_genes.

# subset the common genes and re-order them by common_genes
#raw_counts_mat1 <- raw_counts_mat[common_genes, ]
#raw_counts_mat_subset <- raw_counts_mat[rownames(raw_counts_mat) %in% common_genes, ]

# select only the common genes and re-order them by common_genes
#raw_counts_mat <- raw_counts_mat %>% dplyr::slice(match(common_genes, SYMBOL))

# Use match() and direct indexing
#raw_counts_mat <- raw_counts_mat[, match(common_genes)]

raw_counts_df <- as.data.frame(raw_counts_mat)
raw_counts_subset <- raw_counts_df %>% subset(rownames(raw_counts_df) %in% common_genes) %>% as.matrix() 

head(raw_counts_subset)

1. Call the DESeq2 library.

library(DESeq2)

2. Manage the file format for DESeq2 analysis

# Define condition of each column
coldata <- data.frame(condition = c("hypoxia", "hypoxia", "normoxia", "normoxia"))
rownames(coldata) <- colnames(raw_counts_mat)

# Recheck whether the name is match with the defined condition
coldata

3. Differential gene expression analysis by DESeq2

This start by create a DESeq2 object (dds) where raw_counts_mat and coldata were used as input. This object will used condition as the main factor.

dds <- DESeqDataSetFromMatrix(countData = raw_counts_subset,
                              colData = coldata,
                              design = ~ condition)
dds <- DESeq(dds)
dds

To get differential gene expression, the two conditions (normoxia vs hypoxia) will be used.

res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))

res

4. Observe the DESeq2 results by visualization

4.1 Volcano plot

The volcano plot will label the most differential express genes either upregulation or downregulation. Thus, I firstly define genes that we will labelgenes_to_label.

genes_to_label <- res %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "gene") %>%
  filter(!stringr::str_detect(gene, "LOC"),
         abs(log2FoldChange) >= 2.5,
         padj <= 0.001)

head(genes_to_label)

Moreover, I also want to label the significantly different genes by color. Therefore, the significant(sig) and not significant(not_sig) genes will be defined by the code below for the labeling purpose.

res_label <- res %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "gene") %>%
  mutate(sig = case_when(
    !stringr::str_detect(gene, "LOC") &
      abs(log2FoldChange) >= 2.5 &
      padj <= 0.001 ~ "sig",
    TRUE ~ "not_sig"
  ))

head(res_label)

Now, I create the volcano plot using ggplot2 library and use ggrepel to prevent label overlap. Additionally, I add horizontal and vertical lines to mark the highly significant level based on p-values and fold change, respectively.

library(ggplot2)
library(ggrepel)

ggplot(res_label, aes(x = log2FoldChange, y = -log10(pvalue))) +
  geom_point(aes(color = sig)) +
  scale_color_manual(values = c("blue", "red")) +
  ggrepel::geom_label_repel(data = genes_to_label, aes(label = gene))+
  geom_hline(yintercept = 100, linetype = 2, color = "red") +
  geom_vline(xintercept = c(-2.5, 2.5), linetype = 2, color = "red")+
  theme_bw(base_size = 14)

4.2 Principal Component Analysis (PCA)

# Calculate the normalized data for PCA plot
vsd <- vst(dds, blind=FALSE)

normalized_counts <- assay(vsd) %>% as.matrix()

# Calculate the principal components
pca_prcomp <- prcomp(t(normalized_counts), center = TRUE, scale. = FALSE)

# This is the PC values  
pca_prcomp$x
# Create a DataFrame for visualization
PC1_and_PC2 <- data.frame(PC1 = pca_prcomp$x[, 1], PC2 = pca_prcomp$x[, 2], type = rownames(pca_prcomp$x))

PC1_and_PC2

This will generate the PCA plot.

# Create a PCA plot
ggplot(PC1_and_PC2, aes(x = PC1, y = PC2, col = type)) + 
  geom_point() + 
  geom_text(aes(label = type), hjust = 0, vjust = 0) +
  coord_fixed()+
  theme_classic()

4.3 Heatmap

significant_genes <- res %>%
  as.data.frame() %>%
  filter(padj <= 0.01, abs(log2FoldChange) >= 2) %>%
  rownames()

head(significant_genes, 10)

I will use ComplexHeatmap for a heatmap plotting.

library(ComplexHeatmap)
significant_mat <- normalized_counts[significant_genes, ]

col_anno <- HeatmapAnnotation(
  df = coldata,
  col = list(condition = c("hypoxia" = "red", "normoxia" = "blue"))
)

Heatmap(
  t(scale(t(significant_mat))),
  top_annotation = col_anno,
  show_row_names = FALSE,
  name = "scaled normalized\nexpression"
)

5. Over-Representation Analysis (ORA)

Over-Representation Analysis (ORA) is a method that compares the list of genes of interest (e.g., DEGs) to a background set (all genes in the experiment) to identify pathways that are over-represented among the genes of interest. ### 5.1 Conversion of Gene Symbols

library(clusterProfiler)

#convert gene symbol to Entrez ID 

significant_genes_map<- clusterProfiler::bitr(geneID = significant_genes,
                      fromType="SYMBOL", toType="ENTREZID",
                      OrgDb="org.Hs.eg.db")

head(significant_genes_map)

5.2 Selecting Background Genes

## background genes are genes that are detected in the RNAseq experiment 
background_genes<- res %>% 
  as.data.frame() %>% 
  filter(baseMean != 0) %>%
  tibble::rownames_to_column(var = "gene") %>%
  pull(gene)


res_df<- res %>% 
  as.data.frame() %>% 
  filter(baseMean != 0) %>%
  tibble::rownames_to_column(var = "gene")

background_genes_map<- bitr(geneID = background_genes, 
                            fromType="SYMBOL", 
                            toType="ENTREZID",
                      OrgDb="org.Hs.eg.db")

head(background_genes_map)

5.3 Enrichment Analysis and visualization

5.3.1 Gene Ontology (GO) enrichment analysis

ego <- enrichGO(gene          = significant_genes_map$ENTREZID,
                universe      = background_genes_map$ENTREZID,
                OrgDb         = org.Hs.eg.db,
                ont           = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

head(ego)
dotplot(ego)

5.3.2 Enrichment analysis using the MSigDB database.

# Load the msigdbr package
library(msigdbr)

# Retrieve Hallmark gene sets for Homo sapiens
m_df <- msigdbr(species = "Homo sapiens")

# Them_df object now contains information about the Hallmark gene sets.
head(m_df)

Let’s use the msigdbr function with the species parameter set to “Homo sapiens” and the category parameter set to “H” to retrieve Hallmark gene sets.

m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>% 
  dplyr::select(gs_name, entrez_gene)

To get an overview of the number of genes in each gene set, we can create a table:

# Count the number of genes in each gene set
table(m_t2g$gs_name)

let’s perform gene set enrichment analysis using these sets.

We’ll use a set of significant genes from our analysis (represented by significant_genes_map$ENTREZID) and the Hallmark gene sets (represented by m_t2g). This analysis will help us identify which Hallmark gene sets are enriched in our significant genes.

em <- enricher(significant_genes_map$ENTREZID, TERM2GENE = m_t2g, 
               universe = background_genes_map$ENTREZID)
head(em)
dotplot(em)

5.4 Gene Set Enrichment Analysis (GSEA)

5.4.1 Accessing Gene Sets:

5.4.2 Conversion of Gene Symbols

5.4.3 Enrichment Analysis

res_df <- res_df %>% 
  mutate(signed_rank_stats = sign(log2FoldChange) * -log10(pvalue)) %>%
  left_join(background_genes_map, by = c("gene" = "SYMBOL")) %>%
  arrange(desc(signed_rank_stats))
res_df <- res_df %>%
  mutate(negative_log10pvalue = -log10(pvalue)) %>%
  mutate(negative_log10pvalue = ifelse(is.infinite(negative_log10pvalue), 1000, negative_log10pvalue)) %>%
  mutate(signed_rank_stats = sign(log2FoldChange) * negative_log10pvalue)
gene_list <- res_df$signed_rank_stats
names(gene_list) <- res_df$ENTREZID

em2 <- GSEA(gene_list, TERM2GENE = m_t2g)

head(em2)
em2@result 
# save visualization in p1

p1<- gseaplot(em2, geneSetID = "HALLMARK_G2M_CHECKPOINT", 
              by = "runningScore", title = "HALLMARK_G2M_CHECKPOINT")
p1
# save visualization in p2

p2 <- gseaplot(em2, geneSetID = "HALLMARK_HYPOXIA", 
               by = "runningScore", title = "HALLMARK_HYPOXIA")

p2